Method and system for end-to-end image processing

ABSTRACT

A method of processing an input image comprises receiving the input image, storing the image in a memory, and accessing, by an image processor, a computer readable medium storing a trained deep learning network. A first part of the deep learning network has convolutional layers providing low-level features extracted from the input image, and convolutional layers providing a residual image. A second part of the deep learning network has convolutional layers for receiving the low-level features and extracting high-level features based on the low-level features. The method feeds the input image to the trained deep learning network, and applies a transformation to the residual image based on the extracted high-level features.

The work leading to this disclosure has received funding from theEuropean Research Council under the European Union's Seventh FrameworkProgramme (FP7/2007-2013) ERC grant agreement no. 335491 and grantAgreement 757497.

RELATED APPLICATION

This application is a continuation of U.S. patent application Ser. No.16/251,123 filed on Jan. 18, 2019. The contents of the above applicationare all incorporated by reference as if fully set forth herein in theirentirety.

FIELD AND BACKGROUND OF THE INVENTION

The present invention, in some embodiments thereof, relates to imageprocessing and, more particularly, but not exclusively, to a method anda system for end-to-end image processing.

The use of high resolution cameras in mobile phones has becomeincreasingly popular. However, due to space constraints, their hardwareis limited with respect to the pixel size and the quality of the optics.Moreover, mobile phones are usually hand-held, thus, not stable enoughfor long exposure times. Therefore, in these devices image signalprocessors (ISPs) employ various methods to compensate for theselimitations.

Image signal processors perform image processing pipelines thatre-sample and spatially filter (e.g., interpolate) raw image data. Suchpipelines encompass a sequence of operations, ranging from low-leveldemosaicing, denoising and sharpening, to high-level image adjustmentand color correction. Typically, each task is performed independentlyaccording to different heavily engineered algorithms per task.

SUMMARY OF THE INVENTION

According to an aspect of some embodiments of the present inventionthere is provided a method of processing an input image. The methodcomprises: receiving the input image, storing the image in a memory, andaccessing, by an image processor, a computer readable medium storing atrained deep learning network. The deep learning network has a pluralityof convolutional layers, wherein each of at least a portion of thelayers has a first plurality of feature extraction channels, and asecond plurality of channels storing residual images with correction ofcolor values relative to a previous layer. The method comprises feedingthe input image to the trained deep learning network, summing, for eachof the at least the portion of the layers, a respective residual imagewith a residual image of the previous layer, and feeing the summation toa next layer of the trained deep learning network. The methodadditionally comprises generating on a display device an output showinga residual image contained in a final layer of the trained deep learningnetwork.

According to some embodiments of the invention the trained deep learningnetwork is trained to execute at least two low-level image processingtasks.

According to some embodiments of the invention the at least twolow-level image processing tasks are denoising and demosaicing.

According to some embodiments of the invention the trained deep learningnetwork comprises at least 15 layers.

According to some embodiments of the invention a sum of the first andthe second plurality of channels is at least 32.

According to an aspect of some embodiments of the present inventionthere is provided a method of processing an input image. The methodcomprises receiving the input image, storing the image in a memory, andaccessing, by an image processor, a computer readable medium storing atrained deep learning network. The deep learning network has a firstpart and a second part. The first part has convolutional layersproviding low-level features extracted from the input image, andconvolutional layers providing a residual image. The second part hasconvolutional layers for receiving the low-level features and extractinghigh-level features based on the low-level features. The method alsocomprises feeding the input image to the trained deep learning network,and applying a transformation to the residual image based on theextracted high-level features. The method additionally comprisesgenerating on a display device an output showing a residual imagecontained in a final layer of the trained deep learning network.

According to some embodiments of the invention each of at least aportion of the layers of the first part has a first plurality of featureextraction channels, and a second plurality of channels storing residualimages with correction of color values relative to a previous layer.

According to some embodiments of the invention the method comprisessumming, for each of the at least the portion of the layers of the firstpart, a respective residual image with a residual image of the previouslayer, and feeing the summation to a next layer of the first part.

According to some embodiments of the invention the low-level featurescomprise denoising features and demosaicing features.

According to some embodiments of the invention the input image is a rawimage.

According to some embodiments of the invention the input image is ademosaiced image.

According to some embodiments of the invention the method furthercomprises preprocessing the image by applying a bilinear interpolation,prior to the feeding.

According to some embodiments of the invention the transformationcomprises a non-linear function of color components of each pixel of theresidual image.

According to some embodiments of the invention the applying thetransformation is executed globally to all pixels of the residual image.

According to some embodiments of the invention the first part of thetrained deep learning network comprises at least 15 layers.

According to an aspect of some embodiments of the present inventionthere is provided an image capturing and processing system. The systemcomprises an imaging device for capturing an image; and a hardware imageprocessor for receiving the image and executing the method as delineatedabove and optionally and preferably as further detailed below.

According to some embodiments of the invention the system is a componentin a device selected from the group consisting of a smartphone, a tabletand a smartwatch.

Unless otherwise defined, all technical and/or scientific terms usedherein have the same meaning as commonly understood by one of ordinaryskill in the art to which the invention pertains. Although methods andmaterials similar or equivalent to those described herein can be used inthe practice or testing of embodiments of the invention, exemplarymethods and/or materials are described below. In case of conflict, thepatent specification, including definitions, will control. In addition,the materials, methods, and examples are illustrative only and are notintended to be necessarily limiting.

Implementation of the method and/or system of embodiments of theinvention can involve performing or completing selected tasks manually,automatically, or a combination thereof. Moreover, according to actualinstrumentation and equipment of embodiments of the method and/or systemof the invention, several selected tasks could be implemented byhardware, by software or by firmware or by a combination thereof usingan operating system.

For example, hardware for performing selected tasks according toembodiments of the invention could be implemented as a chip or acircuit. As software, selected tasks according to embodiments of theinvention could be implemented as a plurality of software instructionsbeing executed by a computer using any suitable operating system. In anexemplary embodiment of the invention, one or more tasks according toexemplary embodiments of method and/or system as described herein areperformed by a data processor, such as a computing platform forexecuting a plurality of instructions. Optionally, the data processorincludes a volatile memory for storing instructions and/or data and/or anon-volatile storage, for example, a magnetic hard-disk and/or removablemedia, for storing instructions and/or data. Optionally, a networkconnection is provided as well. A display and/or a user input devicesuch as a keyboard or mouse are optionally provided as well.

BRIEF DESCRIPTION OF SEVERAL VIEWS OF THE DRAWINGS

The patent or application file contains at least one drawing executed incolor. Copies of this patent or patent application publication withcolor drawing(s) will be provided by the Office upon request and paymentof the necessary fee.

Some embodiments of the invention are herein described, by way ofexample only, with reference to the accompanying drawings and images.With specific reference now to the drawings in detail, it is stressedthat the particulars shown are by way of example and for purposes ofillustrative discussion of embodiments of the invention. In this regard,the description taken with the drawings makes apparent to those skilledin the art how embodiments of the invention may be practiced.

FIGS. 1A-1E show images obtained during image processing experimentsperformed using a deep learning network according to some embodiments ofthe present invention;

FIG. 2 is a schematic illustration of an exemplified deep learningnetwork, according to some embodiments of the present invention.

FIGS. 3A-3C show visual examples of restoration results obtained duringa joint denoising and demosaicing experiment performed according to someembodiments of the present invention, using a low-level part of the deeplearning network shown in FIG. 2;

FIGS. 4A-4D are images demonstrating removal of artifact according tosome embodiments of the present invention;

FIG. 5 shows peak signal-to-noise ratio (PSNR) performance as a functionof the number of residual blocks, as obtained in experiments performedaccording to some embodiments of the present invention;

FIG. 6 shows PSNR performance as a function of the number of filters perlayer, for a network with 20 layers, as obtained in experimentsperformed according to some embodiments of the present invention;

FIG. 7 is a graph showing loss as a function of epoch for deep-learningnetworks trained with and without skip connections, as obtained inexperiments performed according to some embodiments of the presentinvention;

FIG. 8 shows a user interface presented to human evaluators duringexperiments performed according to some embodiments of the presentinvention;

FIG. 9 shows mean opinion score (MOS) results obtained in experimentsperformed according to some embodiments of the present invention;

FIG. 10 shows images obtained during image processing experimentsperformed using a deep learning network applied to well-lit input imagesaccording to some embodiments of the present invention;

FIG. 11 shows comparison between images obtained using a deep learningnetwork in which features are shared among a low-level and a high-levelparts of the network, and a deep learning network in which features arenot shared among the low-level and high-level parts of the network;

FIGS. 12A-12D show additional images obtained during image processingexperiments performed using a deep learning network according to someembodiments of the present invention;

FIG. 13 show thumbnails of captured scenes used in experiments performedaccording to some embodiments of the present invention; and

FIG. 14 is a schematic illustration of a system having a hardware imageprocessor, according to some embodiments of the present invention.

DESCRIPTION OF SPECIFIC EMBODIMENTS OF THE INVENTION

The present invention, in some embodiments thereof, relates to imageprocessing and, more particularly, but not exclusively, to a method anda system for end-to-end image processing.

Before explaining at least one embodiment of the invention in detail, itis to be understood that the invention is not necessarily limited in itsapplication to the details of construction and the arrangement of thecomponents and/or methods set forth in the following description and/orillustrated in the drawings and/or the Examples. The invention iscapable of other embodiments or of being practiced or carried out invarious ways.

At least part of the operations described herein can be can beimplemented by a data processing system, e.g., a dedicated circuitry ora general purpose computer, configured for receiving data and executingthe operations described below. At least part of the operations can beimplemented by a cloud-computing facility at a remote location.

Computer programs implementing the method of the present embodiments cancommonly be distributed to users by a communication network or on adistribution medium such as, but not limited to, a flash memory deviceand a portable hard drive. From the communication network ordistribution medium, the computer programs can be copied to a hard diskor a similar intermediate storage medium. The computer programs can berun by loading the code instructions either from their distributionmedium or their intermediate storage medium into the execution memory ofthe computer, configuring the computer to act in accordance with themethod of this invention. All these operations are well-known to thoseskilled in the art of computer systems.

Processing operations described herein may be performed by means ofprocesser circuit, such as a DSP, microcontroller, FPGA, ASIC, etc., orany other conventional and/or dedicated computing system.

The method of the present embodiments can be embodied in many forms. Forexample, it can be embodied in on a tangible medium such as a computerfor performing the method operations. It can be embodied on a computerreadable medium, comprising computer readable instructions for carryingout the method operations. In can also be embodied in electronic devicehaving digital computer capabilities arranged to run the computerprogram on the tangible medium or execute the instruction on a computerreadable medium.

Some embodiments of the present invention comprise a DL network that canbe used within an image processing pipeline or instead of an imageprocessing pipeline. The DL network of the present embodiments isreferred to as DeepISP. In various exemplary embodiments of theinvention the DeepISP jointly learns low-level corrections, such asdemosaicing, denoising and sharpening, and higher level global imagerestoration in an end-to-end fashion. The inventors found that differenttasks in this pipeline can be performed better when performedsimultaneously. In addition, it has an advantage with respect tocomputational efficiency as computed features are shared between thedifferent tasks.

FIG. 2 is a schematic illustration of the architecture of an exemplifiedDeepISP 10, according to some embodiments of the present invention.DeepISP 10 is composed of two parts, depicted at 12 and 14. Part 12extracts low-level features and performs local modifications. Part 14extracts higher level features and performs a global correction. Thenetwork is optionally and preferably fully convolutional, so as toaccommodate any input image resolution.

Low-level features typically describe regions within the respectiveimage bit not their semantic meaning, e.g., irrespectively of theidentification of objects or orientations of objects that are containedwithin the respective image. Low-level features typically describeindividual pixels of the image or patches of pixels (e.g., patcheshaving from 2 to 5 or from 2 to 10 pixels along their largestdimension). For example, low-level features can include informationregarding color assignment for a mosaiced or preliminary demosaicedimage, information regarding noise components within the image,information regarding textures, and information regarding local contrastand/or local gray scale or color variations or gradients. High levelfeatures typically describe objects or orientations of objects that arecontained within the respective image, optionally and preferably alsoclassifying them according to semantic meaning. High level featurestypically describe the entire image or groups of several patches. Forexample, high-level features can include information regarding theshapes and objects that are contained in the image, and optionally andpreferably also regarding edges and lines that are contained in theimage (albeit information pertaining to some of the edges and lines, orsome portions of the edges and lines, e.g., corners, can be included inlow-level features as well).

The term “feature,” as used herein, may refer to any characteristic ofan item of data that may be used to determine whether the item of datafalls within one or more specific categories of imagery information.Typically, a feature includes a measure or a value that can be used todistinguish one or more properties of a pixel, a patch, a group ofpatches or the entire image.

The low-level part 12 of DeepISP 10 optionally and preferably consistsof N_(ll) blocks 16. Typical values for N_(ll) are from about 5 to about50, more preferably from about 10 to about 40, more preferably fromabout 10 to about 30, more preferably from about 10 to about 20. In someExperiments performed by the present inventors, the value of N_(ll) wasset to 15 and some Experiments performed by the present inventors, thevalue of N_(ll) was set to 20.

Each intermediate block performs convolution with filters. The size ofthe filters is typically 3×3 and their stride is typically 1, but otherfilter sizes and strides are also contemplated in some embodiments ofthe present invention.

The input and output sizes of each block are M×N×C, where M and N arethe input image dimensions, and C is the number of channels, which canbe from about 32 to about 256, more preferably from about 32 to about128, e.g., about 64. In some embodiments of the present invention thesame input dimensions are maintained. This can be done for example, byapplying reflection padding to the input. The input to the network canbe raw image input, or, more preferably, an input demosaiced RGB imageproduced, for example, by a bilinear interpolation in a preprocessingstage. In the latter embodiment, part 12 can optionally and preferablyapplies additional demosaicing.

For 3-color images (e.g., RGB images), at each layer, C-3 channels shownat 18 are feed-forward features (left column in FIG. 2), and theremaining 3 channels shown at 20 contain a correction for the RGB valuesof the previous block. The present embodiments also contemplated imagesof more colors, in which case at each layer, C-N_(c) channels 18 arefeed-forward features, and the remaining N_(c) channels 20 contain acorrection for the color values of the previous block, where N_(c) isthe number of colors in the image.

The image channels 20 contain a residual image that is added to theestimation of the previous layer. The first block of the network has asimilar structure but with only the N_(c) channels (3 in the presentExample) of the input image (and not C channels as the other blocks).Unlike channels 18 that carry low-level features, channels 20 carry onlythe residual image (as corrected while being fed from one layer to theother), but not the low-level features (such as, the low-level featuresdescribed above) associated with it. Each of the N_(c) channels 20optionally and preferably contains one of the colors of the residualimage. The C-N_(c) channels 18 of a particular layer contain thefeatures that are associated with the residual image contained by theN_(c) channels 20 of this particular layer. For example, for an RGBimage, one of channels 20 can contain the red pixels of the residualimage, one of channels 20 can contain the green pixels of the residualimage, one of channels 20 can contain the blue pixels of the residualimage, and the C-3 channels 18 can contain the low-level featuresassociated with the residual RGB image. In some embodiments of thepresent invention, each of channels 18 contains information of adifferent low-level feature. However, this need not necessarily be thecase, since, for some applications the low-level features are mixed, sothat at least one of channels 18 contains two or more low-levelfeatures.

Many types of activation functions that are known in the art can be usedin the blocks of part 12, including, without limitation, Binary step,Soft step, TanH, ArcTan, Softsign, Inverse square root unit (ISRU),Rectified linear unit (ReLU), Leaky rectified linear unit, Parametericrectified linear unit (PReLU), Randomized leaky rectified linear unit(RReLU), Exponential linear unit (ELU), Scaled exponential linear unit(SELU), S-shaped rectified linear activation unit (SReLU), Inversesquare root linear unit (ISRLU), Adaptive piecewise linear (APL),SoftPlus, Bent identity, SoftExponential, Sinusoid, Sinc, Gaussian,Softmax and Maxout. A suitable activation function for the featureblocks 18 is ReLU or a variant thereof (e.g., PReLU, RReLU, SReLU), butother activation functions are also contemplated. A suitable activationfunction for the residual images 20 is tanH, but other activationfunctions are also contemplated.

Part 12 optionally and preferably applies small convolutions. Each block16 produces a residual image. Unlike conventional techniques, where allresidual images are accumulated at the last layer, DeepISP 10 performsthe summation at each block, thus allowing the network to get at eachlevel the current image estimate in addition to the calculated features.In FIG. 2, layers that output features are shown on the left side ofparts 12 and 14, and layers that output images or residual images areshown on the right side of part 12 and the center of part 14.

The last block at the low-level part 12 forwards to the high-level part14 the C-N_(c) feature channels in one path, and the currently estimatedimage (I) in another path. The latter uses the features from thelow-level part 12 for estimating a transformation W that is then appliedto the image (I) to produce a global correction of the image.

Part 14 optionally and preferably includes a sequence of N_(hl)convolution layers 22 with filters. Typical values for N_(hl) are fromabout 2 to about 10, more preferably from about 2 to about 6, morepreferably from about 2 to about 4. In some Experiments performed by thepresent inventors, the value of N_(hl) was set to 3.

Each block in part 14 performs convolution with filters. The size of thefilters is typically 3×3 and their stride is typically 2, but otherfilter sizes and strides are also contemplated in some embodiments ofthe present invention.

Each layer is optionally and preferably followed by a max-pooling, whichIn some embodiments of the present invention is a 2×2 max-pooling. Thepurpose of the strides and pooling is getting a large receptive fieldand lowering the computational cost. A global mean-pooling 24 isoptionally and preferably applied to the output of these convolutions,resulting in a single feature vector. This is optionally and preferablyfollowed by a fully connected layer 26 that produces parameters of thetransformation W.

A representative example for a function that can be used for thetransformation is quadratic function of the pixel's R, G, and Bcomponents, but other types of functions, preferably non-linearfunctions, can are also contemplated. A quadratic function suitable forthe present embodiments can be written as:

W·triu([r g b 1]^(T) ·[r g b 1]),   (1)

where triu(·) is the vectorized form of the elements in the uppertriangular of a matrix (to discard redundancies such as r·g and g·r).The operator W ∈ R^(3×10) maps the second-order monomials of each pixelto a new RGB value. The inventors found that such family oftransformations has the advantage of pairing raw low-light and processedwell-lit images, a pairing for which linear regression is inadequate. Asdemonstrated in the Examples section that follows, a non-linear (e.g.,quadratic) transformation produces pleasant looking images.

Unlike conventional techniques in which a model is learned to predict alocal transformation that is then applied to an input image, thetransformation applied in DeepISP 10 is preferably global. The choice ofglobal transformation is counterintuitive since it is known to be lesssuitable for learning classical local tone mapping used for HDR. TheInventors found, however, that when combined with the low-level part ofthe network, which applies local additive corrections, the usage of aglobal transformation is sufficient and enjoys better convergence andstability.

Known in the art, are techniques for calculating a loss for imagerestoration using l₂-distance. While it optimizes mean squared error(MSE), which is directly related to the peak signal-to-noise ratio(PSNR), the Inventors found that it leads to inferior results withrespect to perceptual quality compared to other loss functions.

When training the network only for the task of joint denoising anddemosaicing (i.e., using only part 12), an l₂-loss can be used, and theperformance can be expressed in terms of peak signal-to-noise ratio(PSNR). Yet, it is recognized that in the case of full ISP, PSNR is lesssuitable because a small deviation in the global color results in a verylarge error (and low PSNR), while having no effect on perceptualquality. In this case, a combination of the l₁ norm and a multi scalestructural similarity index (MS-SSIM) can be used to obtain a higherperceptual quality.

A loss function for the full ISP case can be defined, for example, inthe Lab domain. Because the network typically operates in the RGB colorspace, for calculating the loss, an RGB-to-Lab color conversion operatorcan be applied to the network output. This operator is differentiableand it is easy to calculate its gradient. The l₁-loss can be computed onall the three Lab channels. The MS-SSIM is optionally and preferablyevaluated only on the luminance (L) channel:

Loss({circumflex over (I)}, I)=(1-α)∥Lab({circumflex over(I)})-Lab(I)∥₁+αMSSSIm (L({circumflex over (I)}), L(I)).   (2)

The reasoning behind this design choice is to allow the network to learnboth local (captured by MS-SSIM) and global (enforced by l₁)corrections. Applying MS-SSIM to the luminance channel allowed learninglocal luminance corrections even before the color (a and b channels) hasconverged to the target value. Also, MS-SSIM is based on localstatistics and is mostly affected by the higher frequency information,which is of lower significance in the color channels.

FIG. 14 is a schematic illustration of a system 130 having a hardwareimage processor 132, which typically comprises an input/output (I/Ocircuit 134, a hardware central processing unit (CPU) 136 (e.g., ahardware microprocessor), and a hardware memory 138 which typicallyincludes both volatile memory and non-volatile memory. CPU 136 is incommunication with I/O circuit 134 and memory 138. System 130 preferablycomprises a graphical user interface (GUI) 142 in communication withprocessor 132. I/O circuit 134 preferably communicates information inappropriately structured form to and from GUI 142. Further shown, is animaging device 146 such as a digital camera that is in communicationwith processor 132.

GUI 142, processor 132 can be integrated together within the samehousing, for example, in smartphone device, or they can be separateunits communicating with each other. Similarly, imaging device 146 andprocessor 132 can be integrated together within the same housing or theycan be separate units communicating with each other.

GUI 142 can optionally and preferably be part of a system including adedicated CPU and I/O circuits (not shown) to allow GUI 142 tocommunicate with processor 132. Processor 132 issues to GUI 142graphical and textual output generated by CPU 136. Processor 132 alsoreceives from GUI 142 signals pertaining to control commands generatedby GUI 142 in response to user input. GUI 142 can be of any type knownin the art, such as, but not limited to a touch screen, and the like. Inpreferred embodiments, GUI 142 is a GUI of a mobile device such as asmartphone, a tablet, a smartwatch and the like. When GUI 142 is a GUIof a mobile device, the CPU circuit of the mobile device can serve asprocessor 132 and can execute the code instructions described herein.

System 130 can further comprise one or more computer-readable storagemedia 144. Medium 144 is preferably non-transitory storage mediumstoring computer code instructions as further detailed herein, andprocessor 132 executes these code instructions. The code instructionscan be run by loading the respective code instructions into theexecution memory 138 of processor 132. Storage medium 144 preferablyalso store a trained DeepISP as further detailed hereinabove.

Storage medium 144 can store program instructions which, when read bythe processor 132, cause the processor to receive an input image and toexecute the method as described herein. In some embodiments of thepresent invention, an input image is generated by imaging device 146 andis transmitted to processor 132 by means of I/O circuit 134. Processor132 can apply the trained DeepISP of the present embodiments to providean output image as further detailed hereinabove. Processor 130 candisplay the output image on GUI 142, store it in storage medium 144,and/or transmit to a remote location (e.g., upload it to a cloud storagefacility, transmit it directly to another system, such as, but notlimited to, another smartphone, or the like).

As used herein the term “about” refers to ±10%.

The word “exemplary” is used herein to mean “serving as an example,instance or illustration.” Any embodiment described as “exemplary” isnot necessarily to be construed as preferred or advantageous over otherembodiments and/or to exclude the incorporation of features from otherembodiments.

The word “optionally” is used herein to mean “is provided in someembodiments and not provided in other embodiments.” Any particularembodiment of the invention may include a plurality of “optional”features unless such features conflict.

The terms “comprises”, “comprising”, “includes”, “including”, “having”and their conjugates mean “including but not limited to”.

The term “consisting of” means “including and limited to”. The term“consisting essentially of” means that the composition, method orstructure may include additional ingredients, steps and/or parts, butonly if the additional ingredients, steps and/or parts do not materiallyalter the basic and novel characteristics of the claimed composition,method or structure.

As used herein, the singular form “a”, “an” and “the” include pluralreferences unless the context clearly dictates otherwise. For example,the term “a compound” or “at least one compound” may include a pluralityof compounds, including mixtures thereof.

Throughout this application, various embodiments of this invention maybe presented in a range format. It should be understood that thedescription in range format is merely for convenience and brevity andshould not be construed as an inflexible limitation on the scope of theinvention. Accordingly, the description of a range should be consideredto have specifically disclosed all the possible subranges as well asindividual numerical values within that range. For example, descriptionof a range such as from 1 to 6 should be considered to have specificallydisclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numberswithin that range, for example, 1, 2, 3, 4, 5, and 6. This appliesregardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to includeany cited numeral (fractional or integral) within the indicated range.The phrases “ranging/ranges between” a first indicate number and asecond indicate number and “ranging/ranges from” a first indicate number“to” a second indicate number are used herein interchangeably and aremeant to include the first and second indicated numbers and all thefractional and integral numerals therebetween.

It is appreciated that certain features of the invention, which are, forclarity, described in the context of separate embodiments, may also beprovided in combination in a single embodiment. Conversely, variousfeatures of the invention, which are, for brevity, described in thecontext of a single embodiment, may also be provided separately or inany suitable subcombination or as suitable in any other describedembodiment of the invention. Certain features described in the contextof various embodiments are not to be considered essential features ofthose embodiments, unless the embodiment is inoperative without thoseelements.

Various embodiments and aspects of the present invention as delineatedhereinabove and as claimed in the claims section below find experimentalsupport in the following examples.

EXAMPLES

Reference is now made to the following examples, which together with theabove descriptions illustrate some embodiments of the invention in a nonlimiting fashion.

The following Examples present a full end-to-end deep neural model ofthe camera image signal processing (ISP) pipeline, referred to asDeepISP. A deep learning (DL) procedure learns a mapping from the rawlow-light mosaiced image to the final visually compelling image, andencompasses both low-level tasks such as demosaicing and denoising andhigher-level tasks such as color correction and image adjustment. Thetraining and evaluation of the pipeline described in the followingexamples were performed on a dedicated dataset containing pairs oflow-light and well-lit images captured by a Samsung® S7 smartphonecamera in both raw and processed JPEG formats. The following examplesdemonstrate state-of-the-art performance in objective evaluation of PSNRon the subtask of joint denoising and demosaicing. For the fullend-to-end pipeline, following examples demonstrate visual qualitycompared to the manufacturer ISP, in both a subjective human assessmentand when rated by a deep model trained for assessing image quality.

DL-based methods, and more specifically convolutional neural networks(CNNs), have demonstrated considerable success in such image processingtasks. For example, these models have produced state-of-the-art resultsfor demosaicing [1], [2], denoising [3]-[9], deblurring [10]-[15] andsuper-resolution [16]-[20]. Traditional image processing algorithmscommonly rely on hand-crafted heuristics, which require explicitlydefining the prior on natural images statistics. Some examples of priorsused in the literature are: a sparse representation in a redundantdictionary [21], local smoothness [22] and non-local similarity [23]. Anadvantage of DL-based methods is their ability to implicitly learn thestatistics of natural images. Recent research has demonstrated that CNNsare inherently good at generating high-quality images, even whenoperating outside the supervised learning regime, e.g., [24] and [25].

Some studies have explored the application of deep learning for otherimage enhancement tasks. For example, the mapping between pairs of darkand bright JPEG images is learned in [26]. Another example is learning amapping from mobile camera images to DSLR images [27]. These works,however, do not perform end-to-end processing. Rather, they begin froman image already processed by an ISP.

Conditional generative adversarial networks (CGANs) are another commonapproach for image enhancement. These models consist of a generator anda discriminator. The generator maps the source domain distribution to anoutput domain distribution given an input image. The learning isaccomplished by having a discriminator that learns to distinguishbetween generated images and real images and optimizing the generator tofool the discriminator. In one example, color correction for underwaterimages was learned using such an adversarial loss (in addition to otherloss terms) [28]. More examples of generative adversarial networks usedfor image restoration are superresolution [18] and blind deblurring[15]. A main limitation of generative adversarial networks is that theyare not very stable in training and tend to suffer from mode collapse,so only a subset of the domain distribution is generated. For thisreason, the use of generative adversarial networks for image enhancementrequires adding other loss terms.

In contrast to the traditional approach that solves independently thesequence of tasks performed in the standard ISP, DL allows to jointlysolve multiple tasks, with great potential to alleviate the totalcomputational burden. Current algorithms were only able to accomplishthis for closely related tasks, such as denoising and demosaicing [29]or super-resolution and demosaicing [30]. These studies have shown theadvantage of jointly solving different tasks. These examples demonstratethe ability to jointly learn in an end-to-end fashion the full imageprocessing pipeline. Such an approach enables sharing information(features) between parts of the network that perform different tasks,which improves the overall performance compared to solving each problemindependently.

Joint Denoising and Demosaicing A. Evaluation

The DeepISP 10 was firstly evaluated on the task of joint denoising anddemosaicing. The MSR demosaicing dataset [29] is generated bydown-sampling a mosaiced image, so each pixel has its ground truth red,green and blue values. The noise in this dataset is designed to berealistic, the level of noise is estimated in the original image andapplied to the down-sampled image. The standard deviation was measuredfor the mosaiced noisy images compared to their corresponding groundtruth values. The obtained STD range was σ ∈ [1,10]. For the task ofjoint denoising and demosaicing, the Panasonic images in the MSR datasetwere used for training, and the results are reported herein for both thePanasonic and Canon test sets (disjoint from the training sets).

As the denoising and demosaicing task requires only local imagemodifications, only the low-level part 12 of the network (the output ofthe last residual block) was used as the model output. The number ofblocks was set to N_(ll)=20. The mosaiced raw image was transformed toan RGB image by bilinear interpolation during the preprocessing stage.The test set was retained as specified in the dataset and the remaining300 images were split into 270 for training and 30 for validation. Theresolution of all images was 132×220; although some were captured inportrait mode and some in landscape mode, all images were used inlandscape orientation. The data were further augmented with randomhorizontal and vertical flipping. The network was trained for 5000epochs using the Adam optimizer with learning rate 5×10⁻⁵, β₁=0.9,β₂=0.999 and ε=10⁻⁸.

Some visual examples of the restoration results are shown in FIGS. 3A-C,where FIG. 3A shows output examples, FIG. 3B shows input (Bilinear)examples, and FIG. 3C shows the corresponding ground truth images. Theartifacts caused by the interpolation are visible in FIG. 3B, and thenetwork of the present embodiments learns how to remove them. A knownchallenge in demosaicing is the Moire' artifact, which is particularlyobserved in image locations with high frequency patterns. FIGS. 4A-Ddemonstrate how well the method of the present embodiments handles thisartifact. Shown in FIGS. 4A-D are a ground truth image (FIG. 4A), aninput (bilinear image) image (FIG. 4B), an image produced by the methodin [2] (FIG. 4C), and an image produced by the method of the presentembodiments. As shown, the method of the present embodiments removes theartifact even in cases where other methods fail (e.g., see the blueartifact on the far building in FIG. 4C).

Table 1 below summarizes the comparison to other methods on the MSRdataset. Apart for the last row, all numbers in Table 1 are taken from[2]. The results obtained using the method of the present embodiments isthe last row of Table 1. The method of the present embodiments achievesthe best results for joint denoising and demosaicing on both thePanasonic and Canon test sets in the MSR dataset. Compared to theprevious state-of-the-art results (SEM by [2]), the method of thepresent embodiments produces an improvement of 0.38 dB (linear space)and 0.72 dB (sRGB space) on the Panasonic test set, and of 0.61 dB(linear) 1.28 dB (sRGB) on the Canon test set.

TABLE 1 Panasonic Canon Method Linear sRGB Linear sRGB Matlab [34] 34.1627.56 36.38 29.1  OSAP [35] 36.25 29.93 39   31.95 WECD [36] 36.51 30.29— — NLM [37] 36.55 30.56 38.82 32.28 DMMSE [38] 36.67 30.24 39.48 32.39LPA [39] 37 30.86 39.66 32.84 CS [40] 37.2 31.41 39.82 33.24 JMCDM [41]37.44 31.35 39.49 32.41 RTF [29] 37.77 31.77 40.35 33.82 FlexISP [42]38.28 31.76 40.71 33.44 DJDD [1] 38.6 32.6 — — SEM [2] 38.93 32.93 41.0934.15 DeepISP 10 (Part 12) 39.31 33.65 41.7  35.43

The experiment described in this Example corroborates the ability of thenetwork of the present embodiments to generalize well to a differentdataset (training on Panasonic images and testing on Canon images). Thisstrength of is also noticeable with an improvement of 0.71 dB/1.05 dBobtained over another deep learning base method [1] on Linear/sRGBPanasonic. Note that only the MSR dataset, which contains a few hundredimages, was used form training, while the training procedure in [1]uses, in addition, an external dataset with millions of images formining hard examples and training on them.

B. Hyper-Parameters

The large number of residual blocks used in network 10 has two maineffects. Firstly, it allows the network to learn more complicatedfunctions with more parameters and more nonlinear units. Secondly, itgenerates larger receptive field. For example, for N 3×3 convolutionlayers each pixel at the output is a function of 2N+1×2N+1 neighboringinput pixels.

FIG. 5 shows PSNR performance as a function of the number of residualblocks. Increased performance observed for deeper network, reachingconvergence, or diminishing returns, at about 16 layers.

FIG. 6 shows PSNR performance as a function of the number of filters perlayer, for a network with 20 layers. The number of filters per layeraffects the expressiveness of the model. Convergence is reached at about64 filters per layer. Note that increasing the number of filters by afactor a results in a factor a² in the number of parameters, while theparameters number scales only linearly with number of layers.

C. Effect of Skip Connections

Training very deep networks has problems with convergence due tovanishing or exploding gradients. Vanishing gradients are caused whenthe gradient of the loss with respect to a parameter is too small tohave any effect. Exploding gradients is the result of accumulated errorin the calculation of the update step. Both are more apparent in verydeep networks because there is a long path of layers between the lossand the first layers of the network, which implies many multiplicationsin the backward pass that are very likely to either converge to zero orexplode.

Skip connections, or “residual blocks”, were suggested in [43] as a wayof having shorter paths from the output of the network to the firstlayers. These blocks compute the residual features and have been provento be very successful for classification models. The present Inventorsfound that same holds for using residual blocks for regression networks,as used in the model of the present embodiments. To show the advantageof skip connections a model where the skip connections have been removedwas trained. FIG. 7 shows loss as a function of the epoch with andwithout skip connections model. Without skip connections the training isnot stable and after 5000 epochs reaches a loss 2 order of magnitudehigher than the same model with skip connections.

Full ISP A. S7-ISP Dataset

To assess the collective performance of both parts 12 and 14 of network10, a dataset of real-world images was generated. For this purpose,different scenes were captured using a Samsung S7 rear camera. A specialAndroid application was developed to capture a sequence of images whilethe Samsung S7 device is on a tripod and without having to touch it (toavoid camera movement).

While the scenes were chosen to contain minimal motion, a total lack ofmotion during the acquisition process cannot be guaranteed because thecapturing was not performed in a lab setting. For each scene, a JPEGimage was captured using the camera fully automatic mode and theoriginal raw image was saved as well. In addition, a low-light image ofthe same scene was captured, and stored in both JPEG and raw formats.The lowlight image was emulated by capturing the same scene with theexact same settings as those chosen by the camera in the automatic mode,except the exposure time that was set to be quarter of the automaticsetting. Since the camera supports only a discrete set of predefinedexposure times, the closest supported value was selected.

A total of 110 scenes were captured and split to 90, 10 and 10 for thetraining, validation and test sets, respectively. The relatively smallnumber of images in the dataset was compensated by their 3024×4032 (12Mpixel) resolution. Thus, when training on patches, this dataseteffectively contains many different samples. Even for relatively large256×256 patches, it effectively contains over 20 thousand ofnon-overlapping patches (and more than a billion different patches). Thescenes captured include indoors and outdoors, sun light and artificiallight. Thumbnails of the captured scenes are displayed in FIG. 13.

B. Mean Opinion Score

To account for the fact that it is difficult to define an objectivemetric for the full pipeline, a subjective evaluation was performed. Tothis end, the mean opinion score (MOS) was generated for each imageusing Amazon Mechanical Turk to quantitatively assess its quality. Twotypes of experiments were performed. The first experiment involved fullimages, where human evaluators were presented with a single image andwere asked to rate its quality on a scale from 1 (bad) to 5 (excellent).In the second experiment, for rating the quality of details, evaluatorswere presented with multiple versions of the same patch side by side andwere asked to rate each of them. The displayed patch size was set to512×512 pixels (about 2% of the total image). Evaluators were instructedto rate the image quality according to factors like natural colors,details and sharpness (not the content of the image, e.g., composition).Each evaluator was provided the opportunity to rate a specific exampleonly once, but the exact same evaluators did not rate all examples. Theuser interface presented to the evaluators is shown in FIG. 8. Each ofthe three patches at the lower part of FIG. 8 is accompanied by a ratingscale of: 1 Bad, 2 Poor, 3 fair, 4 Good and 5 Excellent.

In addition to scoring by humans, image quality was also evaluated by alearned model from [44] that was trained to estimate human evaluations.The model output was normalized to the range [1,5].

C. DeepISP Evaluation

The DeepISP 10 was tested on the challenging task of learning themapping between low-light raw input images to well-lit JPEG imagesproduced by the Samsung S7 ISP in automatic setting. The mosaiced rawimage was transformed to RGB by bilinear interpolation as apreprocessing stage. DeepISP 10 was used with N_(ll)=15 and N_(hl)=3.For the MS-SSIM part of the loss, patches of 5×5 were used at twoscales. The network was trained with a batch containing a single1024×1024 patch cropped at random at each epoch from one of the trainingimages. The data were augmented with random horizontal flipping. Thetraining lasted for 700 epochs using the ADAM optimizer with thefollowing parameters: a learning rate 5×10⁻⁵, β₁=0.9, β₂=0.999 andε=10⁻⁸.

For faster convergence, the parameters of the learned operator W wereinitialized with an affine operator W_(init) ∈ R^(3×4). In thisinitialization, W_(init), only the first-order monomials of each pixelwere mapped to a new RGB value, so it did not contain the elements thatcorrespond to second-order monomials (they were initialized to zero inW). Linear regression from input pixels to target pixels was applied foreach sample in the training set to get such an affine operator. Asseveral operators were obtained in this way, W_(init) was set to theaverage of them. A linear transformation W_(i) ∈ R^(3×4) was used a asthe initialization of the full operator W ∈ R^(3×10), zeroing itsnon-linear coefficients, due to this averaging operation. Unlike theaffine operator, an average of multiple full operators did not lead to areasonable operator. In other words, this average of several transformsin R^(3×10) did not generate plausible images and did not serve as agood starting point for the optimization.

It is recognized that a camera ISP can typically handle with motionartifacts, which are missing in the dataset used in this Example.Nevertheless, learning to generate high quality images with a shorterexposure time can help mitigating such artifacts.

To evaluate the reconstruction results we use mean opinion score (MOS),which has been generated using the Amazon Mechanical Turk for both fullimages and patch level as specified above. A total of 200 ratings havebeen collected for each image (200 per version of an image, i.e.,DeepISP output, Samsung S7 output and the well-lit ground truth): 100ratings for 10 random patches and additional 100 for the full image.

FIG. 9 presents the MOS results. The leftmost group of bars showsaverage human rating scores for random 512×512 patches, the middle groupof bars shows average human rating scores of the full images, and therightmost group of bars shows rating generated by a deep learning-basedmethod for evaluating image quality [44]. In each group of bars, theleftmost bar corresponds to Samsung S7 ISP output for low-light images,the middle bar corresponds to the output of network 10 of the presentembodiments for the same raw image, and the rightmost bar corresponds toSamsung S7 ISP output for a well-lit scene, which serves as a groundtruth. As shown, for the patch level, DeepISP MOS is 2.86 compared toSamsung S7 ISP which has 2.71 on the same images. The DeepISP MOS forfull images is 4.02 compared to 3.74 achieved by Samsung S7 ISP. Theformer result is only slightly inferior to the MOS 4.05 that is given tothe well-lit images.

It is evident that the visual quality score predicted by DeepIQA [44]corresponds well to the human evaluation with scores of 3.72, 3.92 and4.02 for the Samsung S7 ISP, DeepISP and the well-lit scene,respectively. FIGS. 1A-E and 12A-D present a selection of visualresults. Since Samsung' s low-light images are quite dark (to suppressvisible noise), all images are presented (and have been evaluated) aftera simple histogram stretching to have a fair comparison. The luminancechannel histogram has been stretched to cover the range [0-255] with 5%saturation at the high and low boundaries of the range. Samsung'soriginal low-light images (after the Samsung ISP but before thehistogram applied in this Example) are rated about 1 point lower on theMOS compared to the same images after histogram stretching. FIGS. 1A and12A show exemplified ground truth well-lit images, FIGS. 1B and 12B showthe corresponding raw input low-light images (for visualizationpurposes, after demosaicing by bilinear interpolation), FIGS. 1C and 12Cshow the corresponding output of the Samsung S7 ISP (after histogramstretch), and FIGS. 1D and 12D shows the corresponding output of network10.

DeepISP for Well-Lit Images

While the examples above are described with a particular emphasis tolow-light images, the DeepISP of the present embodiments can also beused for well-lit images. For this purpose, a similar DeepISP wastrained for well-lit images. However, unlike the low-light case, inwhich a higher quality image was used as the ground truth, in this case,only the raw version and its processed JPEG version from the Samsung ISPare available. The DeepISP 10 was therefore trained to mimic the SamsungISP, having the “well-lit” raw image (captured in fully automatic mode)as the input to the network and the JPEG as its target “ground truth”.

In the training phase, the training procedure, the hyper-parameters, andthe number of epochs were the same as for the low-light processingexperiment described above. The initial transformation W_(i) for thehigh-level part, was computed for these inputs in the same way describedfor the low-light case. The results are shown in FIG. 10. The upper-lefttriangles in FIG. 10 are the output of DeepISP 10, and the lower-righttriangles in FIG. 10 are Samsung's output. FIG. 10 demonstrates thateven though in this Example the network was trained in sub-optimalconditions (since the target output was the JPEG image and not noiselesshigher quality image), the DeepISP 10 was able to mimic the ISP andgenerate pleasant looking images. The images were indistinguishable fromthe ground truth when examining the full-scale image, and were close tothe ground truth when examining details.

This experiment demonstrates that a neural model can learn to mimic anISP given as a black box. Moreover, based on the good results achievedin this setting, combined with the good lowlight processing resultsachieved when the high-quality ground truth was given, it is envisionedthat DeepISP 10 is likely to produce a better output when given ahigher-quality ground truth at training.

Shared Features

A modified DeepISP was trained to study the effect of simultaneouslearning of low- and high-level corrections. This experiment shows thatgiven the same budget (number of layers and number of parameters)inferior results are obtained when the information is not shared. Theresults are shown in FIG. 11, where the upper-left triangle is theoutput of DeepISP 10 in which features of part 12 are shared withfeatures of part 14, and the lower-right triangle is the output of anetwork in which the part 14 receives the output image of part 12without sharing the feature among parts 12 and 14. As shown, whenfeatures are not shared, the model often fails to generate good-lookingcolors, and a degraded image quality is obtained.

This Example presented an end-to-end DL procedure that can be used to,or serve as a component in, a full ISP of a digital camera. Unlikeconventional techniques that apply DL to individual tasks in the imageprocessing pipeline, the method of the present embodiments applies allthe tasks of the pipeline collectively. Such an approach has theadvantage of sharing information while performing different tasks. Thislowers the computational costs compared to the case when each processingstep is performed independently. The steps that are excluded in thisexample, but are nevertheless contemplated according to some embodimentsof the present invention are removing camera shake and/or blur, handlingHDR images, and adapting the network for various levels of noise.

The DeepISP 10 of the present embodiments demonstrated in this Exampleits ability to generate visually compelling images from raw low-lightimages. The output of the Samsung S7 ISP was used as the reference bothwith low-light and well-lit raw inputs. In the human evaluation of fullimages, images provided by DeepISP 10 scored about 7% higher than themanufacturer ISP and only about 0.7% below the equivalent well-litimages. Similar trends were observed with the DeepIQA measure.

With respect to the low-level part 12 of DeepISP 10, this Exampledemonstrated that in terms of an objective metric (PSNR) the performanceof this image processing task outperformed all state-of-the-arttechniques by about 0.72 dB PSNR. This Example also demonstrated theability of DeepISP 10 to generalize, outperforming by 1.28 dB PSNR overconventional techniques.

In some embodiments of the present invention a network similar toDeepIQA can be used to further improve the perceptual quality of DeepISP10. It can be used as part of the loss and its gradients can bepropagated through the networks. This may serve as an alternative to theconventional adversarial loss.

Although the invention has been described in conjunction with specificembodiments thereof, it is evident that many alternatives, modificationsand variations will be apparent to those skilled in the art.Accordingly, it is intended to embrace all such alternatives,modifications and variations that fall within the spirit and broad scopeof the appended claims.

It is the intent of the applicant(s) that all publications, patents andpatent applications referred to in this specification are to beincorporated in their entirety by reference into the specification, asif each individual publication, patent or patent application wasspecifically and individually noted when referenced that it is to beincorporated herein by reference. In addition, citation oridentification of any reference in this application shall not beconstrued as an admission that such reference is available as prior artto the present invention. To the extent that section headings are used,they should not be construed as necessarily limiting. In addition, anypriority document(s) of this application is/are hereby incorporatedherein by reference in its/their entirety.

REFERENCES

-   [1] M. Gharbi, G. Chaurasia, S. Paris, and F. Durand, “Deep joint    demosaicking and denoising,” ACM Transactions on Graphics (TOG),    vol. 35, no. 6, p. 191, 2016.-   [2] T. Klatzer, K. Hammernik, P. Knobelreiter, and T. Pock,    “Learning joint demosaicing and denoising based on sequential energy    minimization,” in Computational Photography (ICCP), 2016 IEEE    International Conference on. IEEE, 2016, pp. 1-11.-   [3] T. Remez, O. Litany, R. Giryes, and A. M. Bronstein, “Deep    class-aware image denoising,” in Sampling Theory and Applications    (SampTA), 2017 International Conference on. IEEE, 2017, pp. 138-142.-   [4] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn    denoiser prior for image restoration,” arXiv preprint    arXiv:1704.03264, 2017.-   [5] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising:    Can plain neural networks compete with bm3d?” in IEEE Conference on    Computer Vision and Pattern Recognition (CVPR). IEEE, 2012, pp.    2392-2399.-   [6] W. Feng and Y. Chen, “Fast and accurate poisson denoising with    optimized nonlinear diffusion,” arXiv, abs/1510.02930, 2015.-   [7] Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A    flexible framework for fast and effective image restoration,” IEEE    Transactions on Pattern Analysis and Machine Intelligence (CVPR),    2016.-   [8] O. T. R. Vemulapalli and M. Liu, “Deep gaussian conditional    random field network: A model-based deep network for discriminative    denoising,” in IEEE Conference on Computer Vision and Pattern    Recognition (CVPR), 2016.-   [9] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a    gaussian denoiser: Residual learning of deep cnn for image    denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7,    pp. 3142-3155, July.-   [10] 2017.-   [11] C. J. Schuler, M. Hirsch, S. Harmeling, and B. Scholkopf,    “Learning to deblur,” IEEE transactions on pattern analysis and    machine intelligence, vol. 38, no. 7, pp. 1439-1451,2016.-   [12] L. Xu, J. S. Ren, C. Liu, and J. Jia, “Deep convolutional    neural network for image deconvolution,” in NIPS, 2014.-   [13] J. Sun, W. Cao, Z. Xu, and J. Ponce, “Learning a convolutional    neural network for non-uniform motion blur removal,” in CVPR, 2015.-   [14] A. Chakrabart, “A neural approach to blind motion deblurring,”    in ECCV, 2016.-   [15] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, and O.    Wang, “Deep video deblurring,” in CVPR, 2017.-   [16] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale    convolutional neural network for dynamic scene deblurring,” in CVPR,    2017.-   [17] J. Kim, J. K. Lee, and K. M. Lee, “Accurate image    super-resolution using very deep convolutional networks,” in CVPR,    2016.-   [18] J. Bruna, P. Sprechmann, and Y. LeCun, “Super-resolution with    deep convolutional sufficient statistics,” in ICLR, 2016.-   [19] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A.    Acosta,' A. Aitken, A. Tejani, J. Totz, Z. Wang et al.,    “Photo-realistic single image super-resolution using a generative    adversarial network,” arXiv preprint arXiv:1609.04802, 2016.-   [20] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution    using dense skip connections,” in Proceedings of the IEEE Conference    on Computer Vision and Pattern Recognition, 2017, pp. 4799-4807.-   [21] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep    residual networks for single image super-resolution,” in CVPR    Workshops, 2017.-   [22] M. Aharon, M. Elad, and A. Bruckstein, “rmk-svd: An algorithm    for designing overcomplete dictionaries for sparse representation,”    IEEE Transactions on signal processing, vol. 54, no. 11, pp.    4311-4322, 2006. [22] L. I. Rudin, S. Osher, and E. Fatemi,    “Nonlinear total variation based FIG. 13. Captured scenes from the    S7-ISP dataset-   [23] noise removal algorithms,” Physica D: Nonlinear Phenomena, vol.    60, no. 1-4, pp. 259-268, 1992.-   [24] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for    image denoising,” in Computer Vision and Pattern Recognition, 2005.    CVPR 2005. IEEE Computer Society Conference on, vol. 2. IEEE, 2005,    pp. 60-65.-   [25] Y. Bahat, N. Efrat, and M. Irani, “Non-uniform blind deblurring    by reblurring,” in Proceedings of the IEEE Conference on Computer    Vision and Pattern Recognition, 2017, pp. 3286-3294.-   [26] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,”    arXiv preprint arXiv:1711.10925, 2017.-   [27] L. Shen, Z. Yue, F. Feng, Q. Chen, S. Liu, and J. Ma, “Msr-net:    Low-light image enhancement using deep convolutional network,” arXiv    preprint arXiv:1711.02488, 2017.-   [28] A. Ignatov, N. Kobyshev, K. Vanhoey, R. Timofte, and L. Van    Gool, “DSLR-quality photos on mobile devices with deep convolutional    networks,” arXiv preprint arXiv: 1704.02470, 2017.-   [29] C. Li, J. Guo, and C. Guo, “Emerging from water: Underwater    image color correction based on weakly supervised color transfer,”    arXiv preprint arXiv: 1710.07084, 2017.-   [30] D. Khashabi, S. Nowozin, J. Jancsary, and A. W. Fitzgibbon,    “Joint demosaicing and denoising via learned nonparametric random    fields,” IEEE Transactions on Image Processing, vol. 23, no. 12, pp.    4968-4981, 2014.-   [31] S. Farsiu, M. Elad, and P. Milanfar, “Multiframe demosaicing    and superresolution of color images,” IEEE transactions on image    processing, vol. 15, no. 1, pp. 141-159, 2006.-   [32] M. Gharbi, J. Chen, J. T. Barron, S. W. Hasinoff, and F.    Durand, “Deep bilateral learning for real-time image enhancement,”    ACM Transactions on Graphics (TOG), vol. 36, no. 4, p. 118, 2017.-   [33] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the    dark,” arXiv preprint arXiv:1805.01934, 2018.-   [34] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for    image restoration with neural networks,” IEEE Transactions on    Computational Imaging, vol. 3, no. 1, pp. 47-57, 2017.-   [35] H. S. Malvar, L.-w. He, and R. Cutler, “High-quality linear    interpolation for demosaicing of bayer-patterned color images,” in    Acoustics, Speech, and Signal Processing, 2004.    Proceedings.(ICASSP'04). IEEE International Conference on, vol. 3.    IEEE, 2004, pp. iii-485.-   [36] Y. M. Lu, M. Karzand, and M. Vetterli, “Demosaicking by    alternating projections: theory and fast one-step implementation,”    IEEE Transactions on Image Processing, vol. 19, no. 8, pp.    2085-2098, 2010.-   [37] C.-Y. Su, “Highly effective iterative demosaicing using    weighted-edge and color-difference interpolations,” IEEE    Transactions on Consumer Electronics, vol. 52, no. 2, pp. 639-645,    2006.-   [38] A. Buades, B. Coll, J.-M. Morel, and C. Sbert, “Self-similarity    driven color demosaicking,” IEEE Transactions on Image Processing,    vol. 18, no. 6, pp. 1192-1202, 2009.-   [39] L. Zhang and X. Wu, “Color demosaicking via directional linear    minimum mean square-error estimation,” IEEE Transactions on Image    Processing, vol. 14, no. 12, pp. 2167-2178, 2005.-   [40] D. Paliy, V. Katkovnik, R. Bilcu, S. Alenius, and K.    Egiazarian,-   [41] “Spatially adaptive color filter array interpolation for    noiseless and noisy data,” International Journal of Imaging Systems    and Technology, vol. 17, no. 3, pp. 105-122, 2007.-   [42] P. Getreuer, “Contour stencils for edge-adaptive image    interpolation,” in Proc. SPIE, vol. 7246, 2009, pp. 323-343.-   [43] K. Chang, P. L. K. Ding, and B. Li, “Color image demosaicking    using inter-channel correlation and nonlocal self-similarity,”    Signal Processing: Image Communication, vol. 39, pp. 264-279, 2015.-   [44] F. Heide, M. Steinberger, Y.-T. Tsai, M. Rouf, D. Pajak, D.    Reddy, O. Gallo, J. Liu, W. Heidrich, K. Egiazarian et al.,    “Flexisp: A flexible camera image processing framework,” ACM    Transactions on Graphics (TOG), vol. 33, no. 6, p. 231, 2014.-   [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning    for Image Recognition,” arXiv: 1512.03385, 2015.-   [46] S. Bosse, D. Maniry, K. Muller, T. Wiegand, and W. Samek,    “Deep” neural networks for no-reference and full-reference image    quality assessment,” IEEE Transactions on Image Processing, vol. 27,    no. 1, pp.-   [47] 206-219, January 2018.

What is claimed is:
 1. A method of processing an input image,comprising: receiving the input image, and storing the image in amemory; by an image processor: extracting low-level features from theinput image; generating a residual image devoid of low-level features;extracting high-level features based on said low-level features;applying a transformation to said residual image based on said extractedhigh-level features; and generating on a display device an outputshowing said transformed residual image.
 2. The method of claim 1,wherein said low-level features comprise denoising features anddemosaicing features.
 3. The method of claim 1, wherein the input imageis a raw image.
 4. The method of claim 1, wherein said input image is ademosaiced image.
 5. The method of claim 1, further comprisingpreprocessing said image by applying a bilinear interpolation, prior tosaid feeding.
 6. The method of claim 1, wherein said transformationcomprises a non-linear function of color components of each pixel ofsaid residual image.
 7. The method of claim 1, wherein said applyingsaid transformation is executed globally to all pixels of said residualimage.
 8. The method of claim 1, further comprising capturing the inputimage.
 9. An image capturing and processing system, comprising: animaging device for capturing an image; and a hardware image processorconfigured to receive said captured image, to extract low-level featuresfrom said captured image, to generate a residual image devoid oflow-level features, to extract high-level features based on saidlow-level features, to apply a transformation to said residual imagebased on said extracted high-level features, and to generate on adisplay device an output showing said transformed residual image. 10.The system of claim 9, further comprising said display device.
 11. Thesystem of claim 9, wherein said low-level features comprise denoisingfeatures and demosaicing features.
 12. The system of claim 9, whereinsaid image processor is configured to apply to said captured image atleast one low-level image processing procedure.
 13. The system of claim12, wherein said at least one low-level image processing procedurecomprises demosaicing.
 14. The system of claim 12, wherein said at leastone low-level image processing procedure comprises denoising.
 15. Thesystem of claim 9, wherein said image processor is configured topreprocess said captured image by applying a bilinear interpolation,prior to said extraction of said low-level features.
 16. The system ofclaim 9, wherein said transformation comprises a non-linear function ofcolor components of each pixel of said residual image.
 17. The system ofclaim 9, wherein said image processor is configured to apply saidtransformation globally to all pixels of said residual image.
 18. Thesystem of claim 9, wherein said image processor is configured to applyat least one image processing task selected from the group consisting ofsuper-resolution, deblurring, and image enhancement.
 19. A smartphone,comprising the system of claim
 9. 20. A tablet, comprising the system ofclaim
 9. 21. A smartwatch, comprising the system of claim 9.